Pandas Merging

In [1]:
# Import pandas
import pandas as pd
In [13]:
user_usage = pd.read_csv("user_uses.csv")
user_device = pd.read_csv("user_device.csv")
devices = pd.read_csv("supported_devices.csv")
devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True)
In [14]:
user_device.head()
Out[14]:
use_id user_id platform platform_version device use_type_id
0 22782 26980 ios 10.2 iPhone7,2 2
1 22783 29628 android 6.0 Nexus 5 3
2 22784 28473 android 5.1 SM-G903F 1
3 22785 15200 ios 10.2 iPhone7,2 3
4 22786 28239 android 6.0 ONE E1003 1
In [16]:
devices.head(5)
Out[16]:
manufacturer Marketing Name Device Model
0 NaN NaN AD681H Smartfren Andromax AD681H
1 1&1 10.or 1&1 Puck 10or_G2 FJL21\nhws7721g diw362_1u1 G2 FJL21\nMediaPad 7 Youth 2 DIW362P 1U1\n10or G2
2 10.or D 10or_D D
3 10.or E E E
4 10.or G G G

Merge user_usage with user_device.

We're trying to get the average usage for different types of devices. So we need to get the user's device code from user_usage as a column on user_usage, and then get the device's manufacturer from devices as a column on the result.

First, we merge user_usage with user_device with "use_id" as our common column

In [17]:
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id')
result.head()
Out[17]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id platform device
0 21.97 4.82 1557.33 22787 android GT-I9505
1 1710.08 136.88 7267.55 22788 android SM-G930F
2 1710.08 136.88 7267.55 22789 android SM-G930F
3 94.46 35.17 519.12 22790 android D2303
4 71.59 79.26 1557.33 22792 android SM-G361F

Left merge example

A left merge, or left join, between two dataframes keeps all the records from the left dataframe, in this case "user_usage". Rows from the right dataframe will be kept in the result only where there is a match in the merge variable in the right dataframe, and NaN values will be in the result where not.

In [20]:
result_left = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='left')
print("user_usage dimensions: {}".format(user_usage.shape))
print("result dimensions: {}".format(result.shape))
print("There are {} missing values in the result.".format(
        result['device'].isnull().sum()))
user_usage dimensions: (240, 4)
result dimensions: (240, 6)
There are 81 missing values in the result.
In [23]:
result_left.head()
Out[23]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id platform device
0 21.97 4.82 1557.33 22787 android GT-I9505
1 1710.08 136.88 7267.55 22788 android SM-G930F
2 1710.08 136.88 7267.55 22789 android SM-G930F
3 94.46 35.17 519.12 22790 android D2303
4 71.59 79.26 1557.33 22792 android SM-G361F

Right merge:

A right merge, or right join, between two dataframes keeps all of the records from the right dataframe, in this case "user_device". Rows from the left dataframe will be kept where there is a match in the merge variable, and NaN values will be in the result where it's not.

In [24]:
result_right = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='right')
print("user_device dimensions: {}".format(user_device.shape))
print("result dimensions: {}".format(result.shape))
print("There are {} missing values in the 'monthly_mb' column in the result.".format(
        result['monthly_mb'].isnull().sum()))
print("There are {} missing values in the 'platform' column in the result.".format(
        result['platform'].isnull().sum()))
user_device dimensions: (272, 6)
result dimensions: (240, 6)
There are 0 missing values in the 'monthly_mb' column in the result.
There are 81 missing values in the 'platform' column in the result.
In [25]:
result_right.head()
Out[25]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id platform device
0 21.97 4.82 1557.33 22787 android GT-I9505
1 1710.08 136.88 7267.55 22788 android SM-G930F
2 1710.08 136.88 7267.55 22789 android SM-G930F
3 94.46 35.17 519.12 22790 android D2303
4 71.59 79.26 1557.33 22792 android SM-G361F

Outer merge:

A full outer join, or outer merge, keeps all rows from the left and right dataframe in the result. Rows will be aligned where there is shared join values between the left and right. Rows with NaN values, in either the left-originating or right-originating columns will be. it will be left in the result where there is no shared join value.

In [26]:
print("There are {} unique values of use_id in our dataframes.".format(
        pd.concat([user_usage['use_id'], user_device['use_id']]).unique().shape[0]))

result_outer = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='outer', indicator=True)

print("Outer merge result has {} rows.".format(result.shape))

print("There are {} rows with no missing values.".format(
    (result_outer.apply(lambda x: x.isnull().sum(), axis=1) == 0).sum()))
There are 353 unique values of use_id in our dataframes.
Outer merge result has (240, 6) rows.
There are 159 rows with no missing values.
In [27]:
result_outer.head()
Out[27]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id platform device _merge
0 21.97 4.82 1557.33 22787 android GT-I9505 both
1 1710.08 136.88 7267.55 22788 android SM-G930F both
2 1710.08 136.88 7267.55 22789 android SM-G930F both
3 94.46 35.17 519.12 22790 android D2303 both
4 71.59 79.26 1557.33 22792 android SM-G361F both

Merging with Device manufacturer

In [42]:
# First merge user_usage and user_device to add the platform and device to the user usage.
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id',
                 how='left')

# Now, based on the "device" column in result, match the "Model" column in devices.
devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True)
result = pd.merge(result, 
                  devices[['manufacturer', 'Model']],
                  left_on='device',
                  right_on='Model',
                  how='left')

result.sample(5)
Out[42]:
outgoing_mins_per_month outgoing_sms_per_month monthly_mb use_id platform device manufacturer Model
28 135.09 42.02 5191.12 22833 android E6653 NaN NaN
3 94.46 35.17 519.12 22790 android D2303 NaN NaN
38 1221.85 69.20 6229.33 22858 android ONEPLUS A3003 NaN NaN
52 128.34 1.77 7469.79 22881 android HTC Desire 825 NaN NaN
182 583.73 216.52 1441.29 23548 NaN NaN NaN NaN
In [44]:
# You can apply filter on dataframe according to your reuirement.
devices[devices.Device.str.startswith('GT')]
Out[44]:
manufacturer Marketing Name Device Model
7 3Go GT10K3IPS GT10K3IPS GT10K3IPS
8 3Go\n3Go 3Q GT70053G\nGT7007EQC OC1020A GT70053G\nGT7007EQC OC1020A GT70053G\nGT7007EQC OC1020A